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ABSTRACT 

In the mid to late 1970s, considerable research was 
conducted on the properties of Rasch fit mean squares, resulting in 
transformations to convert the mean squares into approximate 
t-statistics. In the late 1980s and the early 1990s, the trend seems 
to have reversed, with numerous researchers using the untransf ormed 
fit mean squares as a means of testing fit to the Rasch measurement 
models, citing as the principal motivation the influence sample size 
has on the sensitivity of the t-converted mean squares. The 
historical development of these fit indices and the various 
transformations are traced, and the impact of sample size on both the 
fit mean squares and the t-transformat ions of those mean squares is 
examined. Because the sample size problem has little influence on the 
person mean square problem, due to the relatively short length (100 
items or fewer), this paper focuses on the item fit mean squares, 
where it is common to find the statistics used with sample sizes 
ranging from 30 to 10,000. Simulation results indicate that the 
critical value for the mean square used to detect misfit is affected 
both by the type of the mean square and the number of persons in the 
calibration. Foui tables present analysis results. (Contains 11 
references.) (SLD) 
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ABSTRACT 



Throughout the mid to late 1970 's considerable research was 
conducted on the properties of Rasch fit mean squares and this 
work culminated in a variety of transformations to convert the 
mean squares into approximate t-statistics . This work was 
primarily motivated by the influence sample size has on the 
magnitude of the mean squares and the desire to have a single 
critical value that can generally be applied to most cases. In 
the late 1980 's and the early 1990 's the trend seems to have 
reversed, with numerous researchers using the untransf ormed fit 
mean squares as a means of testing fit to the Rasch measurement 
models. The principal motivation is cited as the influence 
sample size has on the sensitivity of the t-converted mean 
squares. The purpose of this paper is to present the historical 
development of these fit indices and the various transformations 
and to examine the impact of sample size on both the fit mean 
squares and the t-trans format ions of those mean squares. Because 
the sample size problem has little influence on the person mean 
square problem, due to the relatively short length (100 items or 
less), this paper focuses on the item fit mean squares, where is 
common to find the statistics used with sample sizes ranging from 
30 to 10,000. 
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Using Item Mean Squares to 
Evaluate Fit to the Rasch Model 



Recent presentations at the Rasch Measurement SIG sessions 
at AERA have stressed the use of the weighted and unweighted item 
mean squares as a means to evaluate the fit of the responses to a 
Rasch model. This evaluation is usually based on a single 
critical value the order of 1.2 to 1.3 for both mean squares. 
The rationale usually given is that the mean square is less 
affected by sample size than the approximate t-statistic 
resulting from the cube root trans fonnat ion of the fit mean 
square. These arguments are the contradictory to the arguments 
used in the late 1970 's and early 1980 's when the fit mean square 
transformations were developed. 



History of Pit 

One of the methods of assessing fit in Rasch measurement 
models, and the technique that is used in most of the calibration 
and analysis programs distributed by MESA Press, is based on 
concatenation of the item/person residual. Other methods, such 
as those based on the likelihood ratio chi-square, will not be 
discussed in this paper. There is an approximately parallel 
history of development for item and person fit statistics based 
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on the item/person residual (Smith, 1989 and Smith, 1991b), 
however, only the development of the item fit statistics, the 
object of inquiry in ♦•his study will be detailed here. 

The item fit statistic, first proposed by Wright and 
Panchapakesan (1969), was based on person raw score groups which 
focused on the difference between the observed and expected score 
for a group of persons with the same raw score on a test. 
Subsequent developments in fit statistics have been based on the 
iten/person residual. The unweighted item total fit statistic 
(UT) in the chi-square form, based on the item/person residual 

(Vni) is 

N 

The standardized residual y,, is 

where x^i is the observed score for each item/person interaction, 
p,i is the probability of a correct response for each 
interaction, and Wn, = Pn.(l - Pm) • This chi-square is calculated 
ior each item by summing over all of the persons in the response 
matrix. 

This chi-square can be converted to a mean square by 
dividing by the number of persons (N) , 



MS ( UT) 
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Note that the degrees of freedom used to convert these and 



subsequent total fit statistics to mean squares are N rather than 
the (N-1) used with the Wright Panchapakesan This is due to 

the fact that the (N-1) overcorrects for the loss in degrees of 
freedom due to using the same Xm to estimate the item and person 
parameters used in calculating the Pm and to calculate the score 
residual. Alternative methods for correcting for the loss in 
degrees of freedom are discussed Smith (1982, 1991b). 

The standard deviation of this mean square can be estimated 

by 



N 1 

E — 
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S[MS{UT)] 

^ N 

These statistics originally were evaluated as fit mean squares 
(FMS)in BICAL, an early Rasch calibration program. Where MS(UT) 
has an expected value of one and the standard deviation given in 
Equation 4. The critical values for detecting misfit with this 
mean square depend on the number of persons and w,,, so they will 
vary from item to item and sample to sample. To simplify the 
critical value problem, the mean square can be standardized to an 
approximate unit normal by a variety of transformations. This 
transformation, the unweighted total item fit statistic, is 
discussed in Wright and Stone (1979). 

Later versions of BICAL introduced a log transformation in 
an attempt to standardize the fit statistics to an approximate 
unit normal distribution. In this transformation 
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t = [ln(MS(C/T) .) + MS(UT).- l] 
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where f is (N-1) for the unweighted total item fit statistic. 
These transformations were introduced because the values of the 
mean squares which indicate possible misfit varied from item to 
item and analysis to analysis depending or. the number of persons, 
the distribution of item difficulties, and the distribution of 

person abilities. 

The last version of BICAL introduced a cube root 
transformation to convert MS(UT) to approximate unit normals. In 
this transformation 

t=[ (MS'''-1) (3/S) ] MS/3) , 6 

where S is the standard deviation of MS(UT) or MS(UB) given above 
in equation 4. 

Experience with these unweighted fit statistic indicated 
that when there was a large range of item difficulties and person 
abilities, unexpected correct responses by low ability persons to 
difficult items and unexpected incorrect responses by high 
ability persons to easy items affected the unweighted mean square 
severely. A relatively small number of anomalous responses can 
result in unusually large mean squares and t-statistics. 

The last version of BICAL also introduced the weighted 
version of the total item fit statistic, which replaced the 
unweighted version in that program. The weighted item total fit 
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statistic was developed to diminish the effect of anomalous 
outliers. In this statistic the squared standardized residual 
(Yni^) is weighted by the information function (Wm) . The weighted 
item total fit statistic (WT) in the chi-square form is 



The weighted total mean square is the sura of the weighted squared 
standardized residuals divided by the sur. of the weights. The 
standard deviation of this statistic is 



The weighted version of the total fit statistic is less affected 
by anomalous responses by persons with ability far from the 
difficulty of the item. A further description of the weighted 
total fit statistic can be found in Wright and Masters (1982) and 
Smith (1991b). 

In recent programs, e.g., BIGSCALE, BIGSTEPS, and FACETS, 
the unweighted fit statistics (item and person) have become known 
as OUTFIT statistics and the weighted fit statistics have become 
known as INFIT statistics. 

This study was designed to illustrate the differences 
between the fit mean squares and the transformed version of the 



MS{WT) .= 
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S[MS(WT) = 
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item fit statistics. This comparison focused on the use of a 
single critical value to determine misfit and effect of sample 
size and the type of statistic being evaluated (OUTFIT vs. INFIT) 
on the distribution of the item fit mean square. The Type I 
error rates of the fit mean square are then compared with those 
of the transformed t-statistic. 



Methods 

In this stud/ 100 replications of simulated data were 
generated under six different conditions which varied the number 
of persons and the number of items. These conditions were: 150 
persons with 20 and 50 item tests, 500 persons with 2 0 and 50 
item tests, and 1000 persons with 20 and 50 item tests. Person 
abilities were normal with a 0, 1 distribution. Item 
difficulties were uniformly distributed from -2.0 to + 2.0 logits 
(See Sci.umacker, Smith, and Bush (1994) for a complete 
description of the simulated data.). All simulated data sets 
were calibrated with the BIGSTEPS program (Wright and Linacre, 
1991) . For each calibration an item file was generated which 
contained the weighted and unweighted mean squares and t- 
statistics for each of the items in that calibration. The mean, 
standard deviation, minimum value, maximum value, and per cent of 
cases above given critical values were calculated for each of 
four statistics, weighted mean squares and t-statistics and 
unweighted mean squares and t-statistics, in each data set. 
These summary statistics were then averaged across the 100 



replications in each combination of test length and number of 
persons. The critical values used to calculate the percent of 
cases with extreme values were fms>1.3, fms>1.2, fms>l.l, fms<.9, 
fms<.8, and fms<.7 for the mean squares and t>+4, t>+3, t>+2, t<- 
2, t<-3, and t<-4 for the t-statistics. 



Results 

The results presented in the following tables are based on a 
summary of the 100 replications for each of the six conditions. 
The mean squares and t-statistics used in this analysis were 
obtained from the item file option available in the BIGSTEPS 
program. The summary information for the weighted mean squares 
is presented in Table 1 and in Table 2 for the unweighted mean 
squares . 

The means of both mean squares (unweighted and weighted) are 
very stable about the expected value of 1.00. The average 
weighted mean square means have a standard deviation of 0.00 
across the six conditions, and the unweighted mean square means 
have a maximum standard deviation of 0.03 across the six 
conditions. The number of persons and the length of the test has 
a small influence on the mean of the unweighted mean squares, and 
the influence on the mean of the weighted mean squares cannot be 
seen in the second decimal point. 

The standard deviation of the mean squares varies 
considerably based on the type of mean square (weighted and 
unweighted) and the number of persons. The mean standard 
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deviation for the unweighted mean squares is approximately double 
that of the weighted mean square. For example, the mean standard 
deviation for the unweighted mean square varies from 0.18 with 
150 persons to 0.06 for 1000 persons. The mean standard 
deviation of the weighted mean square varies from 0.08 for 150 
persons to 0.03 for 1000 persons. The standard deviation does 
not ap ear to be affected by the number of items. 

The range of the mean squares is similarly affected. The 
mean range for the unweighted mean square is 0.72 for 150 persons 
and 20 items, 0.40 for 500 persons and 20 items, and 0.25 for 
1000 persons and 20 items, but the number of items on the test 
has little effect on the range of the unweighted mean square. 
Contrast this with the range of the weighted mean square. In the 
same example given above, the mean range for the weighted mean 
square is 0.29 for 150 persons and 20 items, 0.16 for 500 persons 
and 2 0 items, and 0.10 for 1000 persons and 2 0 items. These are 
less than one-half of the range for the unweighted mean squares. 
As with the unweighted mean square, there appears to be 
considerable influence resulting from the number of persons and 
little influence resulting from test length on the range of the 
mean squares. 

To examine the Type I error rates and the influence of mean 
square type, number of persons and test length, six critical 
values were chosen and the percentage of mean squares exceeding 
those values were calculated. These results are presented in 
Table 3. Values greater than 1.2, a commonly used value for 
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detecting laeasurement disturbances, occurred less than 1 time per 
200 for all sample sizes and test lengths for the weighted mean 
square and values greater than 1.1 occurred less than 1 time per 
200 for sample sizes greater than 500. If weighted mean square 
critical value of 1.2 were to be used, then the Type I error 
rate would approximate .005. With the weighted mean squares the 
per cent greater than the critical value is too small in most 
cases to judge the effect of test length on the statistic. 

For the unweighted mean square and sample size of 150, 
values greater than 1.3 occurred at a rate of approximately 4 per 
cent. For sample size of 500, values greater than 1.3 occurred 
at a rate of approximately 1 per cent. For sample size of 1000, 
values greater than 1.3 occurred at a rate of approximately .1 
per cent. To have a consistent Type I error rate of 
approximat ^ly .04, a critical value of 1.3 would be needed with 
150 person samples, 1.2 with 500 person samples, and 1.1 with 
1000 person samples. It is also clear from these data that 
unweighted mean square is moderately affected by test length with 
the per cent above the critical value approximately 1 per cent 
higher for the 20 item tests than for the 50 item tests. 

It is also clear from the values listed in Table 3 that the 
mean square is not syir.metrically distributed about 1.0. Extreme 
values occur far less frequently below 1.0 then above. This 
means that symmetrical critical values for detecting misfit would 
operate at different Tyoe I error rates for the upper and lower 
tails of the distribution. 



The results of these simulations suggest that no single 
critical value will work vith both weighted and unweighted mean 
squares. It is also clear that no single value will work with 
different sample sizes. If ?• critical value of 1.2 were chosen, 
the actual Type I error rate could vary anywhere from 0.00001 to 
0.10 depending on the set cf circumstances. 

in an effort to contrast the use of the mean square with the 
transformed t-statistic, the frequency of extreme values for the 
same simulations were calculated. These are presented in Table 
4. in this table the critical values chosen were +4, +3, +2, -2, 
-3, and -4. There is no equivalence implied between these values 
and the values chosen for use in Table 3. They are simple 
convenient numerical values. The +2.0 value is often used as an 
indication of misfit with the t-statistic. As is clear from this 
table, the Type I error rate for the unweighted t-statistic is 
approximately twice the value for the weighted version. However, 
the differences across the weighted and unweighted version of the 
t-statistic are less extreme then across the two versions of the 
mean square. For example for 150 persons and 20 items the Type I 
er-or rate for the unweighted t-statistic value of +2.0 is 0.026 
for the weighted t-statistic the value is 0.0135. For the mean 
square with 150 persons and 20 items the Type I error rate for a 
value of 1.2 is 0.006 for the weighted version and 0.10 for the 
unweighted version. This difference is far greater than with the 
t-statibtic. Also, the differences across sample size are less 
drastic with the t-statistic then with the mean square. The Type 
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I error rates for the unweighted t-statistic critical value of 
+2.0 with 150 and 1000 persons are 0.026 and 0.014, a multiple of 
about 2. The Type I error rates for the unweighted mean square 
critical value of 1.2 with 150 and 1000 persons are 0.10 and 
0.0065, a multiple of about 15. 

Although it is clear from these simulations that the use of 
- single critical value for the t-statistic may lead to different 
Type I error rates for different statistics, sample sizes and 
test lengths, the effect of these three factors on the statistics 
is less than those observed for the mean squares. It should be 
noted that Smith (1982, and 1991b) has proposed several methods 
for removing the differences found across fit statistics due to 
the differences in sample size and type of statistic. The values 
reported in this study were generated by BIGSTEPS which does not 
employ these corrections. If these corrections were employed, 
the dissimilarity between the Type I error rates for the t- 
statistics would be less than those observed here. 

Discussion 

Clearly these results indicate that the critical value for 
the mean square used to detect misfit is affected both by the 
type of the mean square and the number of persons in the 
calibration. A single critical value, particularly one of 1.2 or 
1.3 will not give .05 Type I error rates for sample sizes of 500 
or larger. For the weighted version (INFIT) even a value of 1.1 
is too large for sample sizes more than 500. These results have 
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serious implications for BIGSTEPS users since the item fit mean 
squares have become the preferred method with which the fit of 
the data to the model is determined. Many authors suggest that 
the mean square is less sensitive to large sample size that the 
t-transformation. These results show that this is not the case. 
The mean squares are more sensitive to sample size and reliance 
on a single critical value for the mean square can result in an 
under detection of misfit. 
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Table 1 

Weighted Mean Square Descriptive Statistics 



Simulation 1 (150 persons, 20 items) 

Mean s.d. min MAX 
Mean 1.00 0.00 0.99 1.01 
S.D. 0.08 0.01 0.05 0.11 
Maximum 1.15 0.04 1.08 1.33 
Minimum 0.86 0. 03 0.81 0.92 



Simulation 2 (500 persons, 20 items) 

Mfian S.D. MIN -MAX- 
Mean 1.00 0.00 0.99 1.00 
S.D. 0.04 0.01 0.03 0.06 
Maximum 1.08 0.02 1.04 1.15 
Minimum 0.92 0.02 0-87 0-96 

simulation 3 (1000 persons, 20 items) 

KSim S.D. MIN MAX 
Mean 1.00 0.00 0.99 1.00 
S.D. 0.03 0.00 0.02 0.04 
Maximum 1.05 0.01 1.C3 1.09 
Minimum 0.95 0.01 0.91 0.97 

simulation 4 (150 persons, 50 items) 

Usan S.D. MIN MAX 
Mean 1.00 0.00 0.99 1.00 
S.D. 0.07 0.01 0.05 0.09 
Maximum 1.16 0.03 1.09 1.25 
Minimum 0.85 0.03 0.75 0»91 

Simulation 5 (500 persons, 50 items) 

Mean S.D. MIN MAX 
Mean 1.00 0.00 1.00 1.00 
S.D. 0.04 0.00 0.03 0.05 
Maximum 1.09 0.02 1.05 1.16 
Minimum 0.92 0.02 0.86 0.95 

Simulation 6 (1000 persons, 50 items) 

Mean s.d. MIN MAX 
Me.in 1.00 0.00 1.00 1.00 
S.D. 0.03 0.00 0.02 0.03 
Maximum 1.06 0.01 1.04 1.11 
Minimum 0.94 0.01 0.91 0.96 
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Table 2 

Unweighted Mean Square DesCiTiptive Statistics 



Simulation 1 


(150 persons 


, 20 


items) 










MIN 


MAX 


Mean 


1.00 


0.03 


0.95 


1. 11 


S.D. 


0.18 


0.07 


0. 10 


0.53 


Maximum 


1.45 


0.31 


1.17 


3.17 


Minimum 


0.73 


0.06 


0.58 


0.86 


Simulation 2 


(500 persons, 20 


items) 








S.D. 


MIN 


MAX 


Mean 


1.00 


0.01 


0.S7 


1.04 


S.D. 


0.10 


0.03 


0.05 


0. 19 


Maximum 


1.25 


0.13 


1.05 


1.80 


Minimum 


0.85 


0.04 


0.73 


0.91 


Simulation 3 


(1000 persons, 2G 


items) 








S.D. 


MIN 


MAX. 


Mean 


1.00 


0.01 


0.98 


1.02 


S.D. 


0.06 


0.01 


0.03 


0.10 


Maximum 


1. 14 


0.06 


1.03 


1. 34 


Minimum 


0.89 


0.03 


0.80 


0.95 


Simulation 4 


(150 persons, 50 


items) 






Mean 


S.D. 


MIN 


MAX 


Mean 


1.00 


0.01 


0.98 


1.05 


S.D. 


0. 16 


0.03 


0.11 


0.35 


Maximum 


1.52 


0.28 


1.17 


3. 19 


Minimum 


0.71 


0.05 


0.58 


0.81 


Simulation 5 


(500 persons, 50 


items) 








S.D. 


MIN 


MAX 


Mean 


1.00 


0.01 


0.98 


1.02 


S.D. 


0.08 


0.02 


0.06 


0.16 


Maximum 


1.25 


0.14 


1.10 


1.99 


Minimum 


0.82 


0.04 


0.72 


0.88 


Simulation 6 


(1000 persons, 50 items) 






Mean 


S.D. 


MIN 


MAX 


Mean 


1.00 


0.00 


0.99 


1.01 


S.D. 


0.06 


0.01 


0.04 


0.09 


Maximum 


1. 18 


0.07 


1.09 


1.51 


Minimum 


0.87 


0.03 


0.79 


0.92 
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Table 3 

Mean Square Frequency of Extreme Values 



Weighted 









1 


2 


3 


4 


5 


6 


% 


> 


1.3 


0.05 


0.00 


0.00 


0.00 


0.00 


0.00 


% 


> 


1.2 


0.60 


0.00 


0.00 


0.32 


0.00 


0.00 


% 


> 


1.1 


8.05 


0.60 


0.00 


6.90 


0.40 


0. 04 


% 


< 


0.9 


8.35 


0.65 


0.00 


6.62 


0.20 


0. 00 


% 


< 


0.8 


0.00 


0.00 


0.00 


0.08 


0.00 


0.00 


% 


< 


0.7 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 



Unweighted 











RiTn\ilation 






1 


2 


3 


4 


5 


6 


% 


> 1.3 


4.75 


1.35 


0.05 


3.48 


0.52 


0. 12 


% 


> 1.2 


10. 05 


3 .90 


0. 65 


8.44 


1.76 


0. 48 


% 


> 1.1 


21.40 


12.50 


4.85 


20.70 


8.72 


4.38 


% 


< 0.9 


28.05 


11.15 


3. 10 


23.56 


8.88 


2.70 


% 


< 0.8 


8. 30 


0.50 


0.00 


6.36 


0.48 


0.04 


% 


< 0.7 


1. 30 


0.00 


0. GO 


0.84 


0.00 


0.00 


N 


of Persons 


150 


500 


1000 


150 


500 


1000 


N 


of Items 


20 


20 


20 


50 


50 


50 
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Table 4 

t-statistic Frequency of Extreme Values 



Weighted 






<?imulation 








1 


2 




4 


5 


6 


i > A n 
& > 3 n 
% > 2.0 


0.05 
0 . 10 
1.35 


0 . 00 
0. 00 
0.60 


0.00 
0.00 
0.40 


0.00 
0.02 
1.02 


0.00 
0.04 
0.66 


0.00 
0.08 
0.64 


% < -2.0 
% < -3.0 
% < -4.0 


0.65 
0. 00 
0.00 


1.70 
0 . 05 
0.00 


1.10 
0.05 
0.00 


0.70 
0.06 
0.00 


0.70 
u . uu 
0.00 


0.90 
0.00 


Unweighted 






Simulation 








1 


2 


3 


4 


5 


6 


% > 4.0 
% > 3.0 
% > 2.0 


0. 10 
0.40 
2.60 


0.05 
0.50 
2.45 


0.00 
0. 10 
1.40 


0.08 
0.24 
2.24 


0. 10 
0.22 
1.80 


0.06 

0. 26 

1. 56 


% < -2.0 
% < -3.0 
% < -4.0 


0.35 
0.00 
0.00 


1.25 
0.00 
0.00 


0.90 
0. 00 
0. 00 


0.40 
0.02 

COO 


0.62 
0.00 
0.00 


0.80 
0.00 
0.00 


N of Persons 
N of Items 


150 
20 


500 
20 


1000 
20 


150 
50 


500 
50 


1000 
50 
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